64 research outputs found
Balancing reliability, cost, and performance tradeoffs with FreeFault
Abstract—Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mecha-nisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability. I
DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis
Training convolutional neural networks (CNNs) requires intense compute
throughput and high memory bandwidth. Especially, convolution layers account
for the majority of the execution time of CNN training, and GPUs are commonly
used to accelerate these layer workloads. GPU design optimization for efficient
CNN training acceleration requires the accurate modeling of how their
performance improves when computing and memory resources are increased. We
present DeLTA, the first analytical model that accurately estimates the traffic
at each GPU memory hierarchy level, while accounting for the complex reuse
patterns of a parallel convolution algorithm. We demonstrate that our model is
both accurate and robust for different CNNs and GPU architectures. We then show
how this model can be used to carefully balance the scaling of different GPU
resources for efficient CNN performance improvement
The dual-path execution model for efficient GPU control flow
Current graphics processing units (GPUs) utilize the single instruction multiple thread (SIMT) execution model. With SIMT, a group of logical threads executes such that all threads in the group execute a single common instruction on a particular cycle. To enable control flow to diverge within the group of threads, GPUs partially serialize execution and follow a single control flow path at a time. The execution of the threads in the group that are not on the current path is masked. Most current GPUs rely on a hardware reconvergence stack to track the multiple concurrent paths and to choose a single path for execution. Control flow paths are pushed onto the stack when they diverge and are popped off of the stack to enable threads to reconverge and keep lane utilization high. The stack algorithm guarantees optimal reconvergence for applications with structured control flow as it traverses the structured control-flow tree depth first. The downside of using the reconvergence stack is that only a single path is followed, which does not maximize available parallelism, degrading performance in some cases. We propose a change to the stack hardware in which the execution of two different paths can be interleaved. While this is a fundamental change to the stack concept, we show how dual-path execution can be implemented with only modest changes to current hardware and that parallelism is increased without sacrificing optimal (structured) control-flow reconvergence. We perform a detailed evaluation of a set of benchmarks with divergent control flow and demonstrate that the dual-path stack architecture is much more robust compared to previous approaches for increasing path parallelism. Dual-path execution either matches the performance of the baseline single-path stack architecture or outperforms single-path execution by 14.9% on average and by over 30% in some cases.1
SecDDR: Enabling Low-Cost Secure Memories by Protecting the DDR Interface
The security goals of cloud providers and users include memory
confidentiality and integrity, which requires implementing Replay-Attack
protection (RAP). RAP can be achieved using integrity trees or mutually
authenticated channels. Integrity trees incur significant performance overheads
and are impractical for protecting large memories. Mutually authenticated
channels have been proposed only for packetized memory interfaces that address
only a very small niche domain and require fundamental changes to memory system
architecture. We propose SecDDR, a low-cost RAP that targets direct-attached
memories, like DDRx. SecDDR avoids memory-side data authentication, and thus,
only adds a small amount of logic to memory components and does not change the
underlying DDR protocol, making it practical for widespread adoption. In
contrast to prior mutual authentication proposals, which require trusting the
entire memory module, SecDDR targets untrusted modules by placing its limited
security logic on the DRAM die (or package) of the ECC chip. Our evaluation
shows that SecDDR performs within 1% of an encryption-only memory without RAP
and that SecDDR provides 18.8% and 7.8% average performance improvements (up to
190.4% and 24.8%) relative to a 64-ary integrity tree and an authenticated
channel, respectively
Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory
Growing computer system sizes and levels of integration have made memory reliability a primary concern, necessitating strong memory error protection. As such, large-scale systems typically employ error checking and correcting codes to trade redundant storage and band-width for increased reliability. While stronger memory protection will be needed to meet reliability targets in the future, it is undesirable to further increase the amount of storage and bandwidth spent on redundancy. We propose a novel family of single-tier ECC mecha-nisms called Bamboo ECC to simultaneously address the conflicting requirements of increasing reliability while maintaining or decreasing error protection overheads. Relative to the state-of-the-art single-tier error protection, Bamboo ECC codes have superior correction capabilities, all but elim-inate the risk of silent data corruption, and can also increase redun-dancy at a fine granularity, enabling more adaptive graceful down-grade schemes. These strength, safety, and flexibility advantages translate to a significantly more reliable memory system. To demon-strate this, we evaluate a family of Bamboo ECC organizations in the context of conventional 72b and 144b DRAM channels and show the significant error coverage and memory lifespan improvements of Bamboo ECC relative to existing SEC-DED, chipkill-correct and double-chipkill-correct schemes. 1
Near Data Acceleration with Concurrent Host Access
Near-data accelerators (NDAs) that are integrated with main memory have the
potential for significant power and performance benefits. Fully realizing these
benefits requires the large available memory capacity to be shared between the
host and the NDAs in a way that permits both regular memory access by some
applications and accelerating others with an NDA, avoids copying data, enables
collaborative processing, and simultaneously offers high performance for both
host and NDA. We identify and solve new challenges in this context: mitigating
row-locality interference from host to NDAs, reducing read/write-turnaround
overhead caused by fine-grain interleaving of host and NDA requests,
architecting a memory layout that supports the locality required for NDAs and
sophisticated address interleaving for host performance, and supporting both
packetized and traditional memory interfaces. We demonstrate our approach in a
simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory
modules. We show that our mechanisms enable effective and efficient concurrent
access using a set of microbenchmarks, and then demonstrate the potential of
the system for the important stochastic variance-reduced gradient (SVRG)
algorithm
- …